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AN ADAPTIVELY WEIGHTED STATISTIC FOR DETECTING 
DIFFERENTIAL GENE EXPRESSION WHEN COMBINING 
MULTIPLE TRANSCRIPTOMIC STUDIES 

By Jia Li and George C. Tseng^ 

University of Pittsburgh 

Global expression analyses using microarray technologies are be- 
coming more common in genomic research, therefore, new statisti- 
cal challenges associated with combining information from multiple 
studies must be addressed. In this paper we will describe our pro- 
posal for an adaptively weighted (AW) statistic to combine multi- 
ple genomic studies for detecting differentially expressed genes. We 
will also present our results from comparisons of our proposed AW 
statistic to Fisher's equally weighted (EW), Tippett's minimum p- 
value (minP) and Pearson's (PR) statistics. Due to the absence of 
a uniformly powerful test, we used a simplified Gaussian scenario to 
compare the four methods. Our AW statistic consistently produced 
the best or near-best power for a range of alternative hypotheses. 
AW-obtained weights also have the additional advantage of filtering 
discordant biomarkers and providing natural detected gene categories 
for further biological investigation. Here we will demonstrate the su- 
perior performance of our proposed AW statistic based on a mix 
of power analyses, simulations and applications using data sets for 
multi-tissue energy metabolism mouse, multi-lab prostate cancer and 
lung cancer. 

1. Introduction. Integrating results from multiple biological studies is 
now considered commonplace, with significance levels and effect sizes of- 
ten used in meta-analyses. Random effects models which models effect sizes 
are frequently used to address variation in sampling schemes. Differences 
in data structures and statistical hypotheses are common in multiple ap- 
plications, making direct combinations of effect sizes difficult or impossi- 
ble. It is more feasible to combine the transformed probability integrals 
of test statistics (usually values), since the procedure is only dependent 
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on the significance values of individual tests instead of on underlying data 
structures. Fisher's (1932) well-known method of this type involves the log- 
transformation of p-values to Chi-square scores and the equally-weighted 
summation: = — ^^^ilog{pk), where K studies are combined and pk 
is the p- value of study k, 1 < k < K . Assuming independence among stud- 
ies and p-values calculated from correct null distributions in each study, 
2yEW follows a Chi-square distribution with 2K degrees of freedom under 
the null hypothesis. Previously considered other transformations include in- 
verse normal [Stouffer et al. (1949)], logit [Lancaster (1961)] and inverse 
Chi-square transformation with varying degrees of freedom [George (1977)], 
among many others. Although Fisher's method is not the most uniformly 
powerful, it does exhibit good power for a wide range of conditions. It is 
also recognized for its asymptotically Bahadur optimal (ABO) character- 
istic, with multiple studies having the same effect size for alternative hy- 
potheses [Littell and Folks (1971, 1973)]. Different weights or variations of 
Fisher's statistic have also been considered. Good (1955) suggested using 
unequal weights for individual studies in which weights are determined by 
decisions made by subject experts. More recently, Olkin and Saner (2001) 
have proposed a trimmed version of Fisher's statistic to remove the potential 
effects of aberrant extremes. Another well-known method in the category of 
combining values is Tippett's (1931) minimum value statistic (minP): 
ymmP = jnini<fc</^ . Wilkinson (1951) generalized Tippett's procedure to 
a more robust rth smallest p- value, in which v^^^^ = m.dLKi<k<K Pk (maxP) 
is widely used. Note that minP and maxP statistics align with Roy's (1953) 
union-intersection test and Berger's (1982) intersection-union test, respec- 
tively. For comprehensive reviews and comparisons of various meta-analysis 
approaches, see Hedges and Olkin (1985) and Cousins (2007). 

Microarray supports the examination of the expression of thousands of 
genes in parallel. As microarray experiments become more mature and com- 
mon, it has become increasingly important to integrate homogeneous exper- 
imental data sets from multiple laboratories and experimental techniques. 
In contrast to traditional epidemiological or evidence-based medical stud- 
ies, the process of monitoring the expression for thousands of genes simul- 
taneously presents many challenges to integrative analysis. In the current 
biological literature, the term meta-analysis refers to the widespread use of 
naive intersection/union operations or vote counting on lists of differentially 
expressed genes obtained from individual studies using certain criteria — for 
instance. False Discovery Rate < 0.05 [Borovecki et al. (2005); Cardoso et al. 
(2007); Pirooznia, Nagarajan and Deng (2007); Segal et al. (2004), among 
many others]. Intersections are too conservative and unions insufficiently 
conservative, especially as the value of K increases. 

More sophisticated meta-analysis methods can be divided into two tradi- 
tions, the first being the use of a summary statistic — that is, a combination 
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of statistics from individual studies for each gene being considered, adjusted 
for multiple comparisons. In many situations, this type of method is an 
extension of traditional meta-analysis methods. For example, Rhodes et al. 
(2002), who were the first to apply Fisher's method to microarray data, later 
introduced a weighted average of test statistics from individual tests, with 
weights determined by study sample sizes [Ghosh et al. (2003)]. Moreau et 
al. (2003) made use of Tippett's minimum ^3- value. A more robust statistic is 
Wilkinson's rth smallest p-value, in which maximum p-value can be applied 
to the meta-analysis of microarray studies. Owen (2009) reintroduced Pear- 
son's (1934) method and applied it to the AGEMAP project. He defined 
a test statistic as the maximum of Fisher's combination of left-sided and 
right-sided p-values. All of these methods combine statistical significance. 
Note that when no gene effect exists, the p-value is uniformly distributed. 
Accordingly, combining the significance of independent tests is sometimes 
called omnibus or nonparametric. When studies have similar design and 
measure the outcomes in similar ways, combining effect sizes is usually pre- 
ferred to combining significance. Choi et al. (2003) used weighted estimate 
for individual genes based on the random effects model (REM) under Gaus- 
sian assumptions, and discussed the details of a Bayesian formulation for 
the REM model. Hu, Greenwood and Beyene (2005) developed a quality 
measure for each gene in individual studies, incorporating a quality index as 
a weight in the REM model. Hong et al. (2006) proposed a robust rank-based 
approach for meta-analysis. Choi et al. (2007) introduced a latent variable 
approach. 

The second meta-analysis tradition is Bayesian — for example, Choi et 
al.'s (2003) Bayesian version for REM, which models the effect sizes. Similar 
Bayesian hierarchical models have been suggested by Tseng et al. (2001) and 
Conlon, Song and Liu (2006) for incorporating different levels of replicates 
information in cDNA microarray experiments. Conlon, Song and Liu (2007) 
refer to these models as Bayesian probability integration (PI) models, and 
have introduced a Bayesian standardized expression integration (SEI) model. 
Instead of modeling study specific means separately (PI model), SEI models 
them as samples from a normal distribution, thus producing overall mean 
and inter-study variation. Shen, Ghosh and Chinnaiyan (2004) and Choi et 
al. (2007) used a Bayesian mixture model to rescale the individual data set 
and then combined all data sets for an ordinary gene expression analysis. 

The structure for the rest of this paper is as follows: in Section 2 we de- 
scribe two complementary hypothesis settings for detecting study-invariant 
and study-specific biomarkers: HSa and HSb- In Section 3 we present our 
proposal for an adaptively weighted (AW) statistic for meta-analyses of ge- 
nomic studies, including detailed descriptions of the AW statistic algorithm 
and a permutation test for combining multiple studies. In Section 4 we dis- 
cuss a simulation test of our proposed method, using data sets from studies 
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of a multi-tissue energy metabolism mouse model, prostate cancer and lung 
cancer; we then compare our results with those produced by three other 
commonly used methods. In Section 5 we demonstrate the admissibility and 
power of our proposed AW test under a Gaussian assumption, and in Sec- 
tion 6 we summarize its statistical advantages and limitations. 

2. Two major complementary hypothesis settings. To our knowledge, no 
comprehensive evaluations for the above-described meta-analysis methods 
have been performed, primarily due to a lack of rigorous formulation of sta- 
tistical hypotheses. Here we will consider a meta-analysis of Di,D2, ■ ■ ■ , Dk 
gene expression profiles studies. Xkgs is the gene expression intensity of gene g 
and sample s in study k, with samples s = 1, . . . belonging to a control 
group (e.g., normal samples) and s = n/^ -\- 1, . . . ,nk + mrik belonging to the 
diseased group (e.g., cancer samples). Normally a null hypothesis for each 
gene g is considered as 



where 6gk represents the gene effect of gene g and study k. Building on Birn- 
baum's (1954) work, the complementary hypothesis settings {HSa and HSb) 
are dependent upon the nature of the experiment in which the gene ef- 
fects {Ogk) are obtained: 

HSa ■■ {Ho versus Ha -Ogk^ 0,^1 < k < K}, 

HSb '■ {Ho versus Hb ■ at least one 6gk ^ 0,1 < k < K}. 

It is possible to use different methods to explicitly or implicitly consider 
different subsets or variations of the two alternative hypotheses: 

HSai ■■ {Ho versus Hai -.Og = Ogi = ■ ■ ■ = OgK / 0}, 
HSa2 ■■ {Ho versus Ha2 -.eg^O, 9gk ~ N{9g,T^)}, 



Ho ■ Ogl 




K 



HSbh ■■ {Ho versus Hbh ■ I{9gk ^ 0) = h {1 < h < K)} 



k=l 



[/(•) is an indicator function that 



equals 1 when statement true and otherwise]. 
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Without danger of confusion, here we wih use notation to denote the 
parameter space of the corresponding alternative hypothesis. It is clearly 
seen that C Hb- However, they represent two families of complementary 
interpretations in applications. Under Ha, gene g is identified only when it 
is differentially expressed in all studies. Under Hb, gene g is selected only 
if it is differentially expressed in one or more studies. Note that H^i C H^, 
representing an equal fixed effect model. Hyi2 represents a random effects 
model for a similar Ha purpose, while Ha2 % Ha in general. Note also that 
Hb = [ji<h<K Hbh, Hbh' C Hbh {l<h< K) and Hbk' = Hai- 

From a biological standpoint, experimental design and meta-analysis ob- 
jectives determine biomarker lists of interest. To illustrate this idea, we will 
use three sets of microarray studies for meta-analyses. The first set consists 
of two mouse genotypes, wild type (VLCAD +/+) and VLCAD deficient 
(VLCAD —/—), with four mice in each genotype group (VLCAD is as- 
sociated with a childhood metabolism disorder). Brown fat, liver and heart 
tissue samples were collected from each of the eight individual mice and used 
for microarray experiments designed to study global expression changes in 
the knock-out of VLCAD (Table 1, left). Given the experimental design, 
a biomarker list of interest might consist of those genes that are consistently 
expressed in all tissue samples from both wild type and VLCAD-deficient 
mice. This type of tissue-invariant (or study-invariant) biomarker list can 
be loosely defined as Ga, with analysis based on the alternative hypoth- 
esis family of Ha- However, it is reasonable to assume that tissue-specific 
physiology triggers tissue-dependent responses, with pools of differentially 
expressed genes being confounded to the tissues in question. Such a hy- 
pothesis would focus on signature genes that are differentially expressed in 
subsets of one or more tissues — an analysis that corresponds to the Hb alter- 
native hypothesis family. Hereafter we will use the term Gb when addressing 
such tissue-specific or study-specific biomarker lists. In the second study set, 
microarray comparisons of normal versus prostate tumor tissues were per- 
formed by three different research teams: Dhanasekaran et al. (2001), Luo 
et al. (2001) and Welsh et al. (2001) (Table 1). The Ga study-invariant 
biomarker list is clearly of greater biological interest in this situation, since 
many of the Gb study-specific biomarkers represent experimental and tech- 
nical discrepancies between studies, possibly due to sample population het- 
erogeneity, gene matching errors or differences in experimental protocols. 
Further investigation of study-specific biomarkers may provide technical in- 
sights to experimental design features without providing biological insights 
to the disease of interest. The third set of microarray studies [Bhattachar- 
jee et al. (2001); Beer et al. (2002); Garber et al. (2001)] included analyses 
of lung cancer samples and a comparison of normal versus adenocarcinoma 
samples. Table IC shows the pair-wise integrative correlation coefficients 
[Parmigiani et al. (2004)] in each of the three examples. A review of past 
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Table 1 

Three sets of microarray studies for meta-analyses. (BF — brown fat; Liv — liver; Ht — heart; WT — wild type (VLCAD +/+ ); 
VLCAD — VLCAD —/—; N — normal; T — tumor; AC — adenocarcinomas) 



(A) 




Mouse energy metabolism 






Prostate cancer studies 




Lung cancer studies 








BF Liv 


Ht 




Dhan Luo Wels 




Bhat Beer 


Garb 




WT 


4 4 


3 


N 


19 9 9 


N 


17 10 


5 


> 

'Z 




VLCAD 4 4 


4 


T 


14 16 25 


AC 


134 86 


39 


O 


(B) 


















p 


HSa 




Of biological interest 






Of biological interest 




Of biological interest 




p 


HSb 




Of biological interest 






Of less biological interest 




Of less biological interest 














but of more technical interest 


but of more technical interest 




H 


(C) 


















O 




BF 


1 0.06 


0.04 


Dhan 


1 0.05 0.09 


Bhat 


1 0.33 


0.22 






Liv 


0.06 1 


0.03 


Luo 


0.05 1 0.09 


Beer 


0.33 1 


0.15 






Ht 


0.04 0.03 


1 


Wels 


0.09 0.09 1 


Garb 


0.22 0.15 


1 
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Table 2 

Meta-analysis methods, corresponding hypothesis settings and targeted types of 

biomarker list 







Alternative 


Targeted 


Methods 


Abbreviation 


hypothesis 


biomarker list 


Fisher [equally weighted sum of 


EW 


Hb 


Gb 


log(p- values)] 








Tippett (minimum p-value) 


minP 


Hb 


Gb 


Pearson (maximum of Fisher's left- 


PR 


Hb 


Gb 


sided and right-sided score) 








Li and Tseng [adaptively weighted 


AW 


Hb 


Gb 


sum of log(p- values)] 








Wilkinson (maximum p- value) 


maxP 


Ha 


Ga 


Choi (2003); Shen (2004); Choi (2007) 


REM 


Ha2 


Ga 


(random effects model) 








Conlon (2006) (PI Bayesian approach) 


PI 


NA 


Ga 


Conlon (2007) (SET Bayesian approach) 


SEI 


NA 


Ga 



meta-analyses reveals that lung cancer studies generally have larger samples, 
greater homogeneity and better data quality than prostate cancer studies, 
especially in terms of biomarker detection and classification analysis. 

Table 2 presents a list of commonly used meta-analysis methods for mi- 
croarray studies, their corresponding alternative hypotheses and targeted 
biomarkers. While both Bayesian SEI and PI methods tend to detect Ga- 
type biomarkers across studies, the Bayesian concept does not involve hy- 
pothesis testing. Note that different approaches have distinctly different ad- 
vantages and disadvantages in terms of parameter space subsets in alterna- 
tive hypotheses, even though two methods may be designed for the same 
hypothesis. For example, to detect Ga genes, PI performs better than SEI 
for genes that have a high mean effect in one study but low mean effect in 
another. According to Laughin (2004), maxP is generally under-powered, 
but performs well when all 9gk values are nonzero and roughly the same. 
As we will show in Section 5, EW, minP, PR and AW are all admissible for 
detecting Gb genes. For Hsh, EW tends to be more powerful when h is large 
and closer to K. Little and Folks proved that EW is asymptotically ABO 
when detecting G^-type genes under under Hbk' (i-e., Hai), even though 
the EW statistic is targeted toward general Hb- In contrast, minP is more 
powerful in detecting genes under Hbh when h is small. 

From this point forward, our focus will be on the Hb alternative hypoth- 
esis. In the following section we will describe our proposal for an adaptively 
weighted statistic (AW), and, in Section 5, we will demonstrate its robust- 
ness and near-optimal power for alternative hypotheses at either extreme 
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(i.e., when h is close to K or close to 1 in Hbw)- We will also give exam- 
ples of situations in which AW outperforms EW and minP in intermediate 
scenarios. AW is capable of distinguishing Ga and Gb\Ga genes in a man- 
ner that indicates in which study or studies individual biomarkers are dif- 
ferentially expressed — information considered useful for post-meta-analysis 
investigations. 

3. Adaptively-weighted statistic. When integrating multiple genomic 
studies, expression of some important biomarkers may be altered in a study- 
specific manner (consider Hb)- To uncover altered gene expression patterns 
across studies, we start with the following weighted statistic: 

K 

(3.1) Ug{Wg) = -'^Wgk\Og{Pgk), 

k=l 

where pgk is the p- value of gene g in study A;, wt is the weight assigned 
to the /cth study and Wg = {wgi, . . . -.WgK)- Under the null hypothesis that 
(^gk = Vfe, the p-value of the observed weighted statistic, pu{ug{wg)), can 
be obtained for a given gene g and weight Wg (see below for detailed permu- 
tation algorithm to calculate the p- value). The adaptively-weighted statistic 
is defined as the minimal p-value among all possible weights: 

(3.2) y/W^ Tn\n pu{ug{wg)), 

where Ug{w) is the observed statistic for Ug{w), and W \s a. prespecified 
search space. Our choice of search space in this paper \s W = {w \ Wi G 
{0, 1}}, which results in an affordable computation of 0(2^ — 1) based on 
the norm of -ftT < 10 in a microarray meta-analysis. 

The resulting weight reflects a natural biological interpretation of whether 
or not a study contributes to the statistical significance of a gene. Note that 
the AW statistic is inadequate for traditional meta-analysis in epidemiolog- 
ical or evidence-based medicine research. The AW selection procedure will 
introduce selection bias toward studies with concordant significant effects. 
However, integrative analysis of genomic studies represents a different situ- 
ation: usually the primary goal is to screen and identify the most probable 
gene markers, given data meant to facilitate future investigation. As we will 
show in Section 4, the weight vector, w* = argmiiiiiug^w puiugi'Wg)) , actually 
serves as a convenient basis for gene categorization in follow-up biological 
interpretations and explorations. 

Below we illustrate the detailed procedure for AW when applied to com- 
bined genomic studies. If assuming pgk ~ Unif [0, 1] under the null hypoth- 
esis, Ug{wg) ~ Gamma(^^^ Wgfc, 1) and inference of the AW statistic can 
be performed on this basis. Such a uniform p-value assumption is, however. 
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usually not true in real applications. Alternatively, a permutation test is 
performed below to assess the statistical significance and the false discovery 
rate (FDR) is controlled at 5%. For the applications in Section 4, the EW, 
minP, maxP and PR methods are performed using a similar permutation 
test. 

I. Study-wise p-value calculation before meta-analysis: 

(1) Compute the penalized t-statistics, tgk, for gene g and study k 
[Efron et al. (2001); Tusher, Tibshirani and Chu (2001)]. 

(2) Permute group labels in each study for B times, and similarly cal- 
culate the permuted statistics, t^^^ , where l<fl'<G, 1<A;<A', 1< 
b<B. 

(3) Estimate the p-value of tgk as Pgk = (Ef=i E^=i Hi% ^ Ri^gk)))/ 
{B ■ G), where R{tgk) is the rejection region given the threshold tgk- 

Similarly, given t^l compute = (E^=i EJ5=i I(t J2 ^ R{t^l)))/ 
{B-G). 
II. Calculate AW statistic: 

(1) Given a weight Wg = {wgi, . . . ,WgK), the weighted statistic is de- 
fined as Ug{wg) = - Ylk=i'^gk^os{Pgk) for gene g. Define up{wg) = 

-Ef=i^5fclog(pf2). 

(2) Estimate the p- value of the observed Ug{wg) as 

, , Eb=iE%iHu'gHwg)>Ug{wg)} 

Pu{Ug{Wg)) = . 

Similarly compute 



pc/K K)) = ■ 



(3) Based on 11(1) and 11(2), calculate the optimal weight as 

w* =arg min pu{ug{wg)) 

Wg&W 

and, similarly, 

w'^g'^* = arg min pu{u^^\wg)). 

Define the AW statistic Vg as the p- value of the adaptively weighted 

statistic: Vg=pu{ug{w*g)). Similarly, Vg''^ =pu{u^g\wf^*)). 
III. Assess p- values and g- values of the AW statistic — Vg-. 
(1) The p- value of Vg is calculated as 

,,,, i:tiT.%ii{v? <yg] 

Pv{Vg) = B-G ■ 
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(2) Estimate ttq, the proportion of null genes, as 

G^) 

[Storey (2002)]. Normally we choose A = [0.5, 1] and £{A) = 0.5. 

(3) Estimate the g-value for each gene as 

The detected gene list is G^"^ = {g : qv{Vg) < 0.05}. 
IV. Distinguish concordant and discordant genes (recommended): Split the 
detected gene list into concordant and discordant gene lists. By 

controlling the false discovery rate (FDR) at 5%, detected genes with 
concordant regulation direction across contributing studies are denoted 
as GZcord^nt = {9 ■■ QiVa) < 0-05 and | Zti ^mitgk)-wl,\ = Zti J> 
where sgn(-) is the sign function that takes value 1 when positive and —1 
when negative. The discordant gene list is G^^o^dant = <^^^\G'™rdanf 



Remarks. 

1. For the application of EW and the minP, maxP and PR method, steps 
II(1)-II(3) can be skipped. Alternatively, the test statistics are modi- 
fied as Vg = - Y.k=i log(Pgfc) for EW; Vg = mmi<k<KPgk for minP; Vg = 
m.axi<k<KPgk for maxP and Vg = max(- ^^=1 ^osiPgk),- I]f=i ^og{l - 
Pgk)) for PR, where pgk is the one-sided p-value for gene g in study k. 

2. The I-III sequence provides an algorithm for a general framework. Both 
statistics tgk and rejection region R{tgk) can be replaced, depending on 
the experimental design and hypothesis. For example, the F-statistic can 
be used when multiple groups of samples are available in each study under 
consideration. 

3. When conducting comparisons of two groups and applying the moder- 
ated t-statistic, genes detected under the general framework (the I-III se- 
quence) may contain discordant genes — for instance, a gene up-regulated 
in one study and down-regulated in another; the addition of step IV pro- 
vides further filtering. In some applications, a researcher may want to 
scrutinize the discordant gene list to verify whether the discordance re- 
flects actual biological discrepancy across studies (e.g., different tissues 
or patient populations) or artificial errors (e.g., mistakes in gene annota- 
tion). For EW and minP there is no direct criterion for a clear split of 
concordant and discordant genes. After revisiting the PR method for the 
AGEMAP project, Owen found that it is sensitive to consistent left- or 
right-sided departures. The PR method is still easily dominated by one or 
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two exceptionally significant p-values, and does not identify which stud- 
ies are significant in distinguishing between concordant versus discordant 
patterns (see first two examples in Table 6). 
4. Several forms of penalized or moderated t-statistics have been proposed 
and shown to outperform traditional t-statistics [Efron et al. (2001); 
Tusher, Tibshirani and Chu (2001); Smyth (2004)]. For our algorithm we 
choose the penalized t-statistics used in Efron et al. (2001) and Tusher, 
Tibshirani and Chu (2001). The fudge parameter sq is chosen to be the 
median variability estimator in the genome. 

4. Applications. 

4.1. Simulation study. We conducted a simulation study for combining 
four data sets to compare the performance among our proposed AW test, 
Fisher's EW test, Tippett's minP method, Wilkinson's maxP method and 
Pearson's statistic (PR). For each data set, we simulated five normal samples 
from a standard normal distribution and five case samples from A^(^,l). 
A total of gi genes (category I) were differentially expressed across all four 
data sets; g2 = 400 — gi genes were differentially expressed in the fourth 
data set only (category II); and 1600 genes were considered null. Genes are 
called significant by controlling FDR at 5% for each method. Each simulation 
scenario was repeated 1000 times. 

Summaries of the resulting FDR and average number of genes identified in 
each category under three different scenarios appear in the following tables: 
category I and 400 category II genes in Table 4; 200 category I and 200 
category II genes in Table 5; 400 category I and category II genes in 
Table 3. The results are consistent with the power calculation discussed 
in Section 5.1. In Table 3, minP is much more powerful than EW. When 
6 = 2, minP correctly detects an average of 41.6 genes and EW detects only 

Table 3 

Evaluation of AW, EW, minP, maxP and PR methods by simulations in the 
first scenario (I. common DE genes; II. 400 4th-data set-specific DE genes; 
Null. 1600 random noise genes). Average number of genes detected in each 
category and the average FDR are shown under different effect size 9 



Methods 






e = 


2.0 






e = 2.5 


I 


II 


Null 


FDR (s.e.) 


I 


II 


Null 


FDR (s.e.) 


AW 


0.0 


32.1 


1.9 


4.8% (0.002) 


0.0 


137.1 


7.5 


4.9% (0.001) 


EW 


0.0 


7.6 


0.4 


4.1% (0.003) 


0.0 


43.1 


2.0 


4.2% (0.002) 


minP 


0.0 


41.6 


2.4 


5.0% (0.002) 


0.0 


163.0 


8.7 


4.9% (0.001) 


maxP 


0.0 


0.2 


0.1 


25.5% (0.013) 


0.0 


0.2 


0.1 


25.5% (0.013) 


PR 


0.0 


3.2 


0.1 


3.7% (0.004) 


0.0 


15.2 


0.4 


2.2% (0.002) 
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Table 4 

Evaluation of AW, EW, minP, maxP and PR methods by simulations m the second 
scenario (I. 200 common DE genes; II. 200 4th-data set-specific DE genes; Null. 
1600 random noise genes). Average number of genes detected in each category and 





the 


average FDR 


are shown under different 


effect 


size 




Methods 






e = 1.5 






e = 2.0 




I 


II 


Null 


FDR (s.e.) 


I 


II 


Null 


FDR (s.e.) 


AW 


169.1 


24.3 


10.1 


4.9% (0.0005) 


198.7 


59.4 


13.4 


4.9% (0.0004) 


EW 


188.4 


16.9 


8.5 


4.0% (0.0004) 


199.8 


35.4 


9.5 


3.9% (0.0004) 


minP 


25.4 


6.9 


1.9 


5.0% (0.0016) 


144.0 


54.7 


10.3 


4.9% (0.0005) 


maxP 


168.3 


3.7 


8.4 


4.6% (0.0005) 


195.7 


4.4 


9.8 


4.7% (0.0005) 


PR 


178.7 


9.4 


3.8 


2.0% (0.0003) 


199.3 


21.3 


4.3 


1.9% (0.0003) 



7.6 genes. AW detects 32.1 genes, considerably close to minP. Similarly, 
in Table 5, EW (386.8 genes are detected when 9 = 1.5) is more powerful 
than minP (121.3 genes detected) and AW (359.3 genes detected) is close to 
EW in performance. Overall, AW performance was stable in these extreme 
situations. We note most methods show FDR close to 5%, although maxP 
loses so much power at scenario 1 that FDR is inflated and the PR method 
appears slightly conservative. 

4.2. Energy metabolism in mouse model. An energy metabolism disorder 
in children is associated with very longchain acyl-coenzyme A dehydroge- 
nase (VLCAD) deficiencies. In an ongoing unpublished project, two geno- 
types of the mouse model — wild type (VLCAD +/+) and VLCAD-deficient 
(VLCAD — /— ) — were studied for three types of tissues (brown fat, liver 
and heart) with 4 mice in each genotype group. Microarray experiments 
were applied separately to study the expression changes across genotypes. 

Table 5 

Evaluation of AW, EW, minP, maxP and PR methods by simulations m the third 
scenario (I. 400 common DE genes; II. 4th-data set-specific DE genes; Null. 1600 
random noise genes). Average number of genes detected in each category and the average 
FDR are shown under different effect size 9 
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e = 1.5 






e = 2.0 




I 


II 


Null 


FDR (s.e.) 


I 


II 


Null 


FDR (s.e.) 


AW 


359.3 


0.0 


18.6 


4.9% (0.0004) 


398.5 


0.0 


20.4 


4.8% (0.0004) 


EW 


386.8 


0.0 


15.9 


4.0% (0.0003) 


399.8 


0.0 


16.1 


3.9% (0.0003) 


minP 


121.3 


0.0 


6.3 


4.8% (0.0007) 


329.5 


0.0 


16.8 


4.8% (0.0004) 


maxP 


357.5 


0.0 


19.0 


5.0% (0.0004) 


394.9 


0.0 


21.3 


5.1% (0.0004) 


PR 


373.9 


0.0 


7.5 


2.0% (0.0002) 


399.4 


0.0 


7.8 


1.9% (0.0002) 



Table 



Five genes from 


the mouse energy metabolism data. Moderated t-statistics and p-values for individual studies 
AW-obtained weight. AW2 represents AW concordant method 


are listed. 


w* represents 




Moderated t-statistic (p-value) 




Is it detected (q{V) < 5%)? 






Gene 


Brown fat 


Liver 


Heart 


EW 


minP 


PR 


AW 


AW2 


Concordant? 


1423407_a_at 

w* 


2.2 
(0.0027) 
1 


1.7 
(0.0027) 
1 


-3.7 
(0.0014) 
1 


V 


X 


V 


V 


X 


no 


1418429_at 
w* 


3.6 

(0.0003) 
1 


1.1 

(0.067) 



-3.2 
(0.002) 
1 




X 


V 


V 


X 


no 


1449015_at 

w* 


0.4 

(0.46) 



-3.3 
(0.0009) 
1 


-1.8 
(0.011) 
1 


V 


X 


V 


V 


V 


yes 


1416415_a_at 
w* 


-0.8 
(0.15) 



2.2 

(0.0026) 
1 


2.6 

(0.0023) 
1 


V 


X 


V 


V 


V 


yes 


1415727_at 


-1.5 
(0.018) 
1 


-1.6 
(0.014) 
1 


-3.5 
(0.0008) 
1 


V 


X 


V 


V 
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yes 
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(a) AW (general) (b) AW (concordant genes) 

Brcwnfi Liver Heat Brom fat Liver Heart 



WT VLCAD WT VLCAD WT VLCAD WT VLCAD WT VLCAD WT \I.CAD 




Fig. 1. Heatmaps of gene expressions for differentially expressed genes identified by dif- 
ferent methods m the mouse energy metabolism data sets. 

In this study we tested the hypotheses that tissue-specific physiology trig- 
gers tissue-dependent responses, with precise pools of differentially expressed 
genes specific to the tissue in question. The purpose of this hypothesis is to 
identify signature genes that are significant for tissue subsets — an analysis 
that corresponds to HSb- 

Due to the low power of maxP, the Figure 1 data are limited to AW, EW, 
minP and PR methods. Note that EW, minP and AW are based on the sum- 
marization of p- values across studies, and that the methods alone do not dis- 
tinguish among discordant genes with difference in expression across studies 



ADAPTIVELY WEIGHTED STATISTIC 



15 



(e.g., up-regulated in one study but down-regulated in another). The modi- 
fied algorithm of AW for filtering out discordant genes (Section 3, step IV) 
can be implemented in such situations, since it discards all discordant genes 
among studies that contribute to the adaptive weight. The modified AW 
algorithm is not applicable to EW, minP and PR because those methods do 
not provide which studies should be considered for concordance/discordance 
evaluations. 

Overall, the general AW detects 203 genes [Figure 1(a)]; among these, 
28 genes were conflicting in terms of up- or down-regulations — for example. 
Figure 1(b) shows the detection of 175 genes. Adaptive- weights serve as 
a natural grouping process for identified genes: 55 genes with weights of 
(1,1,1) are differentially expressed in all three tissue types [Figure 1(b)], 
and 27 with weights of (0,1,1) were differentially expressed in liver and 
heart tissues, but not in brown fat. The number of detected genes related 
to heart tissue [(1, 1, 1), (1, 0, 1), (0, 1, 1) and (0,0, 1) in Figure 1(b)] is much 
higher than that related to brown fat or liver tissues, representing increase 
impact of VLCAD deletion in heart metabolism activities. According to the 
EW results shown in Figure 1(c), that method detected more genes (329) 
than our proposed AW method. However, the identified gene list is difficult 
to interpret and investigate, even after reordering by hierarchical clustering. 
In this application minP appears to be much less powerful. 

To illustrate AW performance in terms of genes that consistently regu- 
late in the same direction across data sets, details for five genes are pre- 
sented in Table 6. Four of the five methods identified the five example genes 
as differentially expressed (the exception was minP). The first two genes 
(1423407_a_at and 1418429_at) clearly indicate discordant regulation with 
opposite moderated t-statistics between brown fat and heart. Even though 
Pearson's method (PR) was specifically designed to detect concordant genes, 
it failed to achieve this goal in this particular situation. In contrast, our pro- 
posed AW method uses a post-hoc approach (Section 3, step IV) to filter out 
discordant genes. Such a post-hoc procedure is not feasible for EW, minP 
or PR without indicating which studies are differentially expressed. For ex- 
ample, in 1449015_at and 1416415_a_at, the AW method with concordance 
filtering will still identify them as concordant DE genes, even though regu- 
lation of the nonsignificant study (brown fat) contradicts the two significant 
studies. The difference between AW and the natural tendency of biologists 
to pick studies based on p-values obtained from individual analysis is illus- 
trated by the fifth gene, 1415727_at, which produces moderate signals for 
brown fat and liver tissue and a very strong signal for heart tissue, to the 
degree that it can easily be ignored for brown fat and liver following adjust- 
ment for multiple comparisons. It is, in general, difficult to decide whether 
it is a (0,0, 1)- or (1, 1, l)-type of gene. The fact that this gene is moderately 
significant in two studies and very significant in a third study enabled AW 
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to determine that combining results across all three studies gives the best 
statistical significance and it should be a (1, 1, l)-type of gene. 

4.3. Prostate cancer and lung cancer studies. We applied the AW, EW, 
minP and PR methods to three sets of prostate cancer data and three sets 
of lung cancer data (Table 1). Some of the studies were performed by cDNA 
technology [Dhanasekaran et al. (2001), Luo et al. (2001) and Garber et 
al. (2001)] while others used Affymetrix oligo-based technology [Welsh et 
al. (2001), Bhattacharjee et al. (2001) and Beer et al. (2002)]. Data set 
probes were matched according to their Entrez IDs; the intensities of mul- 
tiple probes matching the same ID were averaged. For the prostate cancer 
data set, comparisons were made between clinically localized cancer and 
benign tissues. For the lung cancer data set we compared tissues from ade- 
nocarcinoma patients with those from healthy donors. 

The results shown in Figures 2 and 3 reflect characteristics that are simi- 
lar to those discussed in the above mouse example. With an exception, minP 
did not perform as poorly as it did in Section 4.1. Compared to the other 
methods, our proposed AW method identified much clearer patterns. Of the 
722 genes in Figure 2(a), 618 genes show consistent regulation across stud- 
ies [Figure 2(b)]. Approximately 14% of the identified genes were discordant 
across studies. Possible causes of discordant genes may include mistaken gene 
annotations in old array platforms [Dai et al. (2005)], differential probe ef- 
ficiencies, heterogeneous sample populations across studies and nonspecific 
cross hybridizations. According to our findings, only moderately concordant 
information existed across the three prostate cancer studies, probably be- 
cause (a) their sample sizes were small, or (b) they entailed in-house cDNA 
arrays or commercial products that were still in the early stages of develop- 
ment. Of the 618 concordant AW-detected genes, 130 genes (21%) were con- 
sistent (l,l,l)-type biomarkers and 205 genes (33.2%) were specific to one 
study only: 55 (1, 0, 0)-type biomarkers, 70 of the (0, 1,0) type, and 80 of the 
(0,0, 1) type. The EW, minP and PR methods all detected slightly greater 
numbers of biomarkers than the AW method (924, 745 and 882, resp.). 
However, in each case the detected biomarkers were difficult to interpret 
and follow up, and all three methods presented challenges in terms of guar- 
anteeing the detection of concordant genes only. In summary, our findings 
suggest that results from individual microarray studies require careful inter- 
pretation, and that integrative analyses are appropriate as a validation tool. 

Similar patterns and results were obtained when the four methods were 
applied to lung cancer studies (Figure 3). The AW method detected 366 
genes, with 349 confirmed as concordant (only 4.6% are discordant compared 
to 14.4% in prostate cancer). Among the 349 concordant biomarkers, 99 were 
type (1, 1, 1) (28.4% compared to 21% in prostate cancer) and 96 were single 
study specific (27.5% compared to 33.2% in prostate cancer): 7 type (1,0,0), 




Fig. 2. Heatmaps of gene expression intensities for differentially expressed genes identi- 
fied by different methods in the prostate cancer data sets. 



18 



J. LI AND G. C. TSENG 




Fig. 3. Heatmaps of gene expression intensities for differentially expressed genes identi- 
fied by different methods in the lung cancer data sets. 
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51 type (0,1,0) and 38 type (0,0,1). Overall, our lung cancer studies had 
more biomarkers that were consistent in terms of concordant up-regulation 
and down-regulation patterns, and fewer single study-specific biomarkers. 
These results match those from previous reports showing better consistency 
among lung cancer studies compared to prostate cancer studies, possibly 
due to larger sample sizes, better gene annotations, more specific disease 
subtype comparisons and better array quality. For example, Bhattacharjee 
and Beer used Affymetrix platforms, while Garber's data were generated 
from the lab of Pat Brown, the inventor of cDNA arrays. 



5. Power and admissibility. In this section we drop the subscript g for 
genes and assume independence among studies when comparing five test 
statistics (EW, AW, minP, maxP and PR) for Hb at the univariate level. 
The maxP statistic is included for demonstration purposes although it is not 
targeted to Hb- To date, no best method for combining multiples studies has 
been identified, therefore, choosing a combined statistic must reflect speciflc 
biological purposes. Birnbaum (1954, 1955) established general conditions 
for evaluating combined methods, including monotonicity and admissibility. 
To compare several combined test procedures, he considered a one-sample 
test of the mean of a Gaussian distribution with known variance. We will 
use a similar two-sample test of the means of two Gaussian distributions 
with known variance: 

(5.1) Z,= ^^^Zl}^ , k = l,2,...,K, 

o-fcVl/nfci + l/nfc2 

where Xik = (l/n^i) • Es=i ^ks, X2k = (iKa) • Er=ntrfi ^ks, X^s ~ A^(0, 
a1) when 1 < s < Uki and Xks ~ N{6k,al) when n^i + 1 < s < Uki + nk2- 
We will use the two-sided p-values Pk = Fic{\Z\ > \zk\\Ok = 0) for study k, 
where Z is the standard normal distribution, to examine the acceptance 
regions of the various combined test procedures. The simplified framework 
is the focus for the discussion in the Appendix of admissibility and power 
comparisons of the five statistics. It is shown there that AW, EW, PR and 
minP are all admissible, but maxP is not. 



5.1. Power comparison of EW, AW, minP, maxP and PR under Hbh'- 
Denote by Go = {^i = ■ ■ ■ = Ok = and Qa = {at least one 9^ / 0} (i.e., 
Hb) the null and alternative hypothesis. Letting (3^{6;a) be the power of 
a test controlled at level a for the OW statistic given 6 G Qa, we have 

K 

(5.2) /3AW(^.^)^p^(^Aw<^AW|^)^i_ / T\p(p,\e)dP^---dPK, 
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where is the sokition of v to the equation P{y^ < "wlQc) = a, = 

n|:rHp(-K)) > c^^} = n&Mf^K) < ^g~(el....i)(^ - ^'""^ 

and F„ ^ , is the inverse CDF of a Gamma distribution with param- 

eters a and /3, Wj = {wji, . . . , wjk), Wjk G {0, 1}, /c = 1, . . . , and enumer- 
ation index j exhausts aU different weight vector possibihties such that 
'Ylik=i'^jk > 1- If the nuh hypothesis is true, it is generahy accepted that 
the individual Pk value is uniformly distributed on [0, 1]. The density of the 
p-value under alternative law is expressed as 

p{x\e) 



(5.3) p{p\e) 



p(x|0) 



(0<P<1), 

x=g{P) 



where x = g{P) indicates the solution of P = f{x\^) dx [Pearson (1938)]. 
Similarly, the power for EW and minP can be calculated by I3^'^{e-a) = 
/qew \[k=iP{Pk\0) dPi--- dPK, /3-i°P(0; a) = 1 - [fc^,^pp{P \ 9) dP]^ and 

CEW = i7-i ,^_,(l-a),C^*°P = F-l ^,(a) = l-(l-a)V^,CS'^'^P = 

" Gamma(A,l)^ Beta(l,il)^ ' ^ / i a 

In our simplified setting, the Z test in (3) is used for power calculations, 
hence, the density of P^ is 



(5.4) 



piPk\Ok) = ^exp||[2a>-i(l -Pfc/2) -Cfc]| 

+ ^exp|-|^[2$-i(l-Pfc/2)+Cfc] 



where = , ^'^ , k = 1, . . . , K . We consider n^i = nfc2 = 5 and 

CTfc = 1 SO that the effect size is represented by 0^ and power is evaluated 
with varying effect sizes. 

The graphs in Figure 4 reflect a situation in which K = 10 for simplified 
alternative hypothesis H^h' (1 ^ h < K). Studies with nonzero effect sizes 
share a common effect size 0. Power curves under 6 S {1.2, 1.4} and varying 
values of h are displayed. Due to the difficulty of achieving an exact power 
calculation for K = 10, we performed 10,000 simulations to generate power 
curves. EW and AW are calculated for one-sided p-values for the purpose of 
comparability with PR, maxP, minP. In application, it is unlikely that the 
signs of effect will be known, therefore, two-sided p-values for maxP, minP, 
EW and AW are preferred. As expected, the figure shows that minP is more 
powerful than EW when h is small, and EW is more powerful than minP 
when h is large. On the other hand, AW performs stably and comparably to 
the best method in situations involving the two extremes. The performance 
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Fig. 4. Power analysis of EW, AW, minP, PR and maxP under Hsh' , 1 < h< K . We 
compare power curves of the five methods combining K — 10 studies. X axis represents h, 
the number of studies that have nonzero effects. 

of maxP further confirms Loughin's conclusion that it has very low power 
unless h = K. 

6. Discussion. In this paper we described our proposal for an adaptively 
weighted (AW) statistic for combining multiple studies, and reported our 
findings after applying it to two sets of combined microarray studies. Ac- 
knowledging that meta-analysis methods depend heavily on the biological 
question being investigated, we formulated two statistical hypothesis set- 
tings {HSa and HSb) to identify differentially expressed genes considered 
significant in either partial or full data sets. Classical EW, minP and our 
proposed AW methods were used to analyze HSa- 

According to our findings, AW, EW and minP are all admissible in simpli- 
fied scenarios. In terms of power analysis, EW was more powerful when all 
data sets were significant, while minP was more powerful when only one or 
a small number of data sets were significant. As a compromise between EW 
and minP, the AW method performed close to the best method in either 
extreme alternative hypothesis setting (Figure 5). Simulation results also 
confirmed this robust property of AW (Tables 3-5). In applications, AW 
had the additional advantage of categorizing differentially expressed genes 
by their adaptive weights, thus providing a practical basis for further biolog- 
ical exploration. In addition to not detecting discordantly regulated genes, 
the modified algorithm in Section 3, step IV, was appealing for the specific 
biological purpose of identifying all nondiscordant genes. 
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EW minP maxP 




-3J -3- 



Fig. 5. Acceptance regions of EW, AW, minP, PR and maxP statistic for combimng 
p-values from two independent studies when testing means of Gaussian distributions with 
known variances. 

In this project we restricted the binary 0, 1 adaptive weight search space 
for purposes of computational convenience and biological interpretability. 
For example, in Figure 1(b) the AW data support an immediate catego- 
rization of detected biomarkers, as well as information on similar /dissimilar 
differential gene expression between tissue pairs. As shown for the EW data 
in Figure 1(c), Fisher's method generated a large number of nontraceable 
biomarkers that were difficult to work with in terms of follow-up analy- 
ses. Theoretically, it is possible to extend the 0, 1 space to a nonrestricted 
real number (i.e., positive weights that add up to 1). However, such results 
generate biomarker lists similar to those generated by the EW method [Fig- 
ure 1(c)]. In other words, using nonbinary weights may be slightly superior 
statistically, but not biologically. 

There are three limitations in addition to possible future extensions for fu- 
ture research. First, we assumed that all studies contain an identical matched 
gene list with no missing values. In actual practice, separate studies to be 
combined usually come from different microarray platforms. Requiring an 
identical matched gene list and no missing values will exclude many impor- 
tant genes that appear in certain studies but not in others, thus requiring 
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an extension that allows for missing values. Second, we focused on two- 
group comparison in this paper, and made a modification in order to limit 
detection to genes with concordant expression changes. To compare more 
than two groups, the F-statistic and its variations can be applied; resulting 
p-values from F tests can be combined similarly as described for the algo- 
rithm in Section 3. However, small values across studies do not guarantee 
concordant expression patterns. To address this problem, we have developed 
a multi-class correlation approach [Lu, Li and Tseng (2010)]. Third, our pro- 
posed method focuses on HSb rather than HSa, which is not the case with 
many biological applications. Finally, the AW statistic can be extended from 
biomarker detection to gene set enrichment analyses. Note that post-meta- 
analysis enriched pathways (gene sets) are thought to be more supportive 
of biological interpretations. 

While we only considered combining multiple microarray studies in this 
paper, the methods we described can easily be extended to combinations 
of multiple genomic, epigenomic and/or proteomic studies — for instance, 
data sets from SNP arrays, genome arrays, methylation arrays, proteomic 
experiments and ChlP-on-chip experiments. Additional extensions and/or 
alternative models are required to accommodate biological knowledge and 
to address specific questions of interest. 

APPENDIX: ADMISSIBILITY 

A test is considered admissible if it cannot be uniformly improved by 
any other test. No single test has been accepted as the most powerful, even 
in the simplified scenarios. Birnbaum expressed a necessary and sufficient 
condition (known as Theorem 5.1) for any test to be admissible under this 
situation. 

Theorem 1 [Birnbaum (1954, 1955)]. Under Hb and the test statistic 
is in the exponential family [e.g., equation (5.1)], the necessary and suffi- 
cient condition for a combined test procedure to he admissible is that the 
corresponding acceptance region is convex. 

Since the acceptance regions of EW and minP have been identified as con- 
vex, both methods are admissible; maxP is not. When proving that the PR 
method is admissible, Owen (2009) clarified Birnbaum's (1954) misinterpre- 
tation of the PR method. The acceptance regions of EW, minP, maxP, AW 
and PR on the plane of a pair of Z statistics at level 0.05 are shown in Fig- 
ure 5. When illustrating the rejection regions of several common combined 
tests (including EW and minP), Birnbaum showed a preference because 
it appeared to be "fairly sensitive in all directions." From Figure 5, it is 
clear that the PR method prefers effects that show common directions in 



24 



J. LI AND G. C. TSENG 



two studies, since the rejection regions in the first and third quadrants are 
less stringent than the second and fourth quadrants. Note that AW actually 
shares positive aspects of both EW and minP methods: generally more sensi- 
tive than minP when parameters from both studies depart from and more 
sensitive than EW when only one of the parameters departs from 0, and 
more sensitive than the minP method when parameters from both stud- 
ies depart from 0. According to the following corollary, AW is admissible 
because the intersection of convex sets is convex, therefore, its acceptance 
region is convex. 



Corollary 1. The acceptance region of AW is convex and, thus, AW 
is admissible under Hb and assumption (5.1). 



Proof. Denote by pk = 2(1 — <I>(|zfc|)) the two-sided p-value, where 
= J^_^(j){t) dt, (j){t) is the density of the standard normal distribu- 
tion. First we prove that f{zk) = — log(pfc) = — log(l — + C is con- 
vex. f"{z) = jx^^QpifAd^l) - |z|[l -$(|z|)]} when It is well known 
that the elementary upper bound for 1 — is (j){x)/x, for x > 0. Thus, 
f"{z) > when z ^0. Since f{z) is continuous at z = 0, f{z) is convex in z. 
Hence, f{zi,Z2,...,Zn) = — X]fc=i log(Pfc) for any n > 1 is convex, because 
the sum of convex functions is convex. For the AW statistic, the acceptance 
region is {zi,Z2, . . . ,zk ■ nimi<k<K piu{w)) > c}, where p(u{w)) is the right- 
sided p-value of U{w): 



\zi,Z2,...,ZK- min p{u{w)) > c\ 
L 0<k<K ) 



K 



/fce{o,i},i<fc<A I V fc=i / > 



K 



n \zi,Z2,...,ZK:-^\og\p^\<-ij 

4e{0,l},l<A:<A I fc=l 



J = 1,2,..., 2^-1, 



7o is F ^ . ^(l — c). Thus, the acceptance region of AW is convex 

'■I Gamma{5J;.^j/fc,l)" 

since the intersection of convex sets is also convex. □ 
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